Computation and Language 73
☆ A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
Chaoyou Fu, Renrui Zhang, Haojia Lin, Zihan Wang, Timin Gao, Yongdong Luo, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Mengdan Zhang, Peixian Chen, Sirui Zhao, Xiawu Zheng, Shaohui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Xing Sun, Rongrong Ji
The surge of interest towards Multi-modal Large Language Models (MLLMs),
e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both
academia and industry. They endow Large Language Models (LLMs) with powerful
capabilities in visual understanding, enabling them to tackle diverse
multi-modal tasks. Very recently, Google released Gemini, its newest and most
capable MLLM built from the ground up for multi-modality. In light of the
superior reasoning capabilities, can Gemini challenge GPT-4V's leading position
in multi-modal learning? In this paper, we present a preliminary exploration of
Gemini Pro's visual understanding proficiency, which comprehensively covers
four domains: fundamental perception, advanced cognition, challenging vision
tasks, and various expert capacities. We compare Gemini Pro with the
state-of-the-art GPT-4V to evaluate its upper limits, along with the latest
open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and
black-box systems. The qualitative samples indicate that, while GPT-4V and
Gemini showcase different answering styles and preferences, they can exhibit
comparable visual reasoning capabilities, and Sphinx still trails behind them
concerning domain generalizability. Specifically, GPT-4V tends to elaborate
detailed explanations and intermediate steps, and Gemini prefers to output a
direct and concise answer. The quantitative evaluation on the popular MME
benchmark also demonstrates the potential of Gemini to be a strong challenger
to GPT-4V. Our early investigation of Gemini also observes some common issues
of MLLMs, indicating that there still remains a considerable distance towards
artificial general intelligence. Our project for tracking the progress of MLLM
is released at
https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.
comment: Total 120 pages. See our project at
https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models
☆ Efficient Title Reranker for Fast and Improved Knowledge-Intense NLP
We introduce Efficient Title Reranker via Broadcasting Query Encoder, a novel
title reranking technique to achieve efficient title reranking 20x-40x faster
than vanilla passage reranker. However, one of the challenges with the training
of Efficient Title Reranker is the instability. Analyzing the issue, we found
some very difficult ground truths might act as noisy labels causing accuracy to
drop as well as some extreme values in model probability output causing nan. To
address these issues, we introduce the Sigmoid Trick, a novel technique that
reduces the gradient update of both cases resulting in better retrieval
efficacy. Experiments showed the effectiveness of ETR and sigmoid trick as we
achieved four state-of-the-art positions on the kilt knowledge benchmark.
☆ SpokesBiz -- an Open Corpus of Conversational Polish
Piotr Pęzik, Sylwia Karasińska, Anna Cichosz, Łukasz Jałowiecki, Konrad Kaczyński, Małgorzata Krawentek, Karolina Walkusz, Paweł Wilk, Mariusz Kleć, Krzysztof Szklanny, Szymon Marszałkowski
This paper announces the early release of SpokesBiz, a freely available
corpus of conversational Polish developed within the CLARIN-BIZ project and
comprising over 650 hours of recordings. The transcribed recordings have been
diarized and manually annotated for punctuation and casing. We outline the
general structure and content of the corpus, showcasing selected applications
in linguistic research, evaluation and improvement of automatic speech
recognition (ASR) systems
☆ Avoiding Data Contamination in Language Model Evaluation: Dynamic Test Construction with Latest Materials AAAI 2024
Data contamination in evaluation is getting increasingly prevalent with the
emerge of language models pre-trained on super large, automatically-crawled
corpora. This problem leads to significant challenges in accurate assessment of
model capabilities and generalisations. In this paper, we propose LatestEval,
an automatic method leverages the most recent texts to create uncontaminated
reading comprehension evaluations. LatestEval avoids data contamination by only
using texts published within a recent time window, ensuring no overlap with the
training corpora of pre-trained language models. We develop LatestEval
automated pipeline to 1) gather latest texts; 2) identify key information, and
3) construct questions targeting the information while removing the existing
answers from the context. This encourages models to infer the answers
themselves based on the remaining context, rather than just copy-paste. Our
experiments demonstrate that language models exhibit negligible memorisation
behaviours on LatestEval as opposed to previous benchmarks, suggesting a
significantly reduced risk of data contamination and leading to a more robust
evaluation. Data and code are publicly available at:
https://github.com/liyucheng09/LatestEval.
comment: AAAI 2024
☆ PowMix: A Versatile Regularizer for Multimodal Sentiment Analysis
Multimodal sentiment analysis (MSA) leverages heterogeneous data sources to
interpret the complex nature of human sentiments. Despite significant progress
in multimodal architecture design, the field lacks comprehensive regularization
methods. This paper introduces PowMix, a versatile embedding space regularizer
that builds upon the strengths of unimodal mixing-based regularization
approaches and introduces novel algorithmic components that are specifically
tailored to multimodal tasks. PowMix is integrated before the fusion stage of
multimodal architectures and facilitates intra-modal mixing, such as mixing
text with text, to act as a regularizer. PowMix consists of five components: 1)
a varying number of generated mixed examples, 2) mixing factor reweighting, 3)
anisotropic mixing, 4) dynamic mixing, and 5) cross-modal label mixing.
Extensive experimentation across benchmark MSA datasets and a broad spectrum of
diverse architectural designs demonstrate the efficacy of PowMix, as evidenced
by consistent performance improvements over baselines and existing mixing
methods. An in-depth ablation study highlights the critical contribution of
each PowMix component and how they synergistically enhance performance.
Furthermore, algorithmic analysis demonstrates how PowMix behaves in different
scenarios, particularly comparing early versus late fusion architectures.
Notably, PowMix enhances overall performance without sacrificing model
robustness or magnifying text dominance. It also retains its strong performance
in situations of limited data. Our findings position PowMix as a promising
versatile regularization strategy for MSA. Code will be made available.
comment: Preprint
☆ Bypassing the Safety Training of Open-Source LLMs with Priming Attacks
With the recent surge in popularity of LLMs has come an ever-increasing need
for LLM safety training. In this paper, we show that SOTA open-source LLMs are
vulnerable to simple, optimization-free attacks we refer to as $\textit{priming
attacks}$, which are easy to execute and effectively bypass alignment from
safety training. Our proposed attack improves the Attack Success Rate on
Harmful Behaviors, as measured by Llama Guard, by up to $3.3\times$ compared to
baselines. Source code and data are available at
https://github.com/uiuc-focal-lab/llm-priming-attacks .
☆ Instruct-SCTG: Guiding Sequential Controlled Text Generation through Instructions
Instruction-tuned large language models have shown remarkable performance in
aligning generated text with user intentions across various tasks. However,
maintaining human-like discourse structure in the generated text remains a
challenging research question. In this paper, we propose Instruct-SCTG, a
flexible and effective sequential framework that harnesses instruction-tuned
language models to generate structurally coherent text in both fine-tuned and
zero-shot setups. Our framework generates articles in a section-by-section
manner, aligned with the desired human structure using natural language
instructions. Furthermore, we introduce a new automatic metric that measures
discourse divergence in a fuzzy manner. Extensive experiments on three datasets
from representative domains of news and recipes demonstrate the
state-of-the-art performance of our framework in imposing discourse structure
during text generation, as verified by both automatic and human evaluation. Our
code will be available on Github.
☆ Automated speech audiometry: Can it work using open-source pre-trained Kaldi-NL automatic speech recognition?
A practical speech audiometry tool is the digits-in-noise (DIN) test for
hearing screening of populations of varying ages and hearing status. The test
is usually conducted by a human supervisor (e.g., clinician), who scores the
responses spoken by the listener, or online, where a software scores the
responses entered by the listener. The test has 24 digit-triplets presented in
an adaptive staircase procedure, resulting in a speech reception threshold
(SRT). We propose an alternative automated DIN test setup that can evaluate
spoken responses whilst conducted without a human supervisor, using the
open-source automatic speech recognition toolkit, Kaldi-NL. Thirty
self-reported normal-hearing Dutch adults (19-64 years) completed one
DIN+Kaldi-NL test. Their spoken responses were recorded, and used for
evaluating the transcript of decoded responses by Kaldi-NL. Study 1 evaluated
the Kaldi-NL performance through its word error rate (WER), percentage of
summed decoding errors regarding only digits found in the transcript compared
to the total number of digits present in the spoken responses. Average WER
across participants was 5.0% (range 0 - 48%, SD = 8.8%), with average decoding
errors in three triplets per participant. Study 2 analysed the effect that
triplets with decoding errors from Kaldi-NL had on the DIN test output (SRT),
using bootstrapping simulations. Previous research indicated 0.70 dB as the
typical within-subject SRT variability for normal-hearing adults. Study 2
showed that up to four triplets with decoding errors produce SRT variations
within this range, suggesting that our proposed setup could be feasible for
clinical applications.
comment: 25 pages (double spaced), 5 figures, 3 tables, 54 references
☆ Geo-located Aspect Based Sentiment Analysis (ABSA) for Crowdsourced Evaluation of Urban Environments
Sentiment analysis methods are rapidly being adopted by the field of Urban
Design and Planning, for the crowdsourced evaluation of urban environments.
However, most models used within this domain are able to identify positive or
negative sentiment associated with a textual appraisal as a whole, without
inferring information about specific urban aspects contained within it, or the
sentiment associated with them. While Aspect Based Sentiment Analysis (ABSA) is
becoming increasingly popular, most existing ABSA models are trained on
non-urban themes such as restaurants, electronics, consumer goods and the like.
This body of research develops an ABSA model capable of extracting urban
aspects contained within geo-located textual urban appraisals, along with
corresponding aspect sentiment classification. We annotate a dataset of 2500
crowdsourced reviews of public parks, and train a Bidirectional Encoder
Representations from Transformers (BERT) model with Local Context Focus (LCF)
on this data. Our model achieves significant improvement in prediction accuracy
on urban reviews, for both Aspect Term Extraction (ATE) and Aspect Sentiment
Classification (ASC) tasks. For demonstrative analysis, positive and negative
urban aspects across Boston are spatially visualized. We hope that this model
is useful for designers and planners for fine-grained urban sentiment
evaluation.
comment: Created for 6.8610, Quantitative Methods for Natural Language
Processing at MIT Fall 2022. 5 pages, 4 figures
☆ GeomVerse: A Systematic Evaluation of Large Models for Geometric Reasoning
Large language models have shown impressive results for multi-hop
mathematical reasoning when the input question is only textual. Many
mathematical reasoning problems, however, contain both text and image. With the
ever-increasing adoption of vision language models (VLMs), understanding their
reasoning abilities for such problems is crucial. In this paper, we evaluate
the reasoning capabilities of VLMs along various axes through the lens of
geometry problems. We procedurally create a synthetic dataset of geometry
questions with controllable difficulty levels along multiple axes, thus
enabling a systematic evaluation. The empirical results obtained using our
benchmark for state-of-the-art VLMs indicate that these models are not as
capable in subjects like geometry (and, by generalization, other topics
requiring similar reasoning) as suggested by previous benchmarks. This is made
especially clear by the construction of our benchmark at various depth levels,
since solving higher-depth problems requires long chains of reasoning rather
than additional memorized knowledge. We release the dataset for further
research in this area.
☆ Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment
With the continuous growth in the number of parameters of transformer-based
pretrained language models (PLMs), particularly the emergence of large language
models (LLMs) with billions of parameters, many natural language processing
(NLP) tasks have demonstrated remarkable success. However, the enormous size
and computational demands of these models pose significant challenges for
adapting them to specific downstream tasks, especially in environments with
limited computational resources. Parameter Efficient Fine-Tuning (PEFT) offers
an effective solution by reducing the number of fine-tuning parameters and
memory usage while achieving comparable performance to full fine-tuning. The
demands for fine-tuning PLMs, especially LLMs, have led to a surge in the
development of PEFT methods, as depicted in Fig. 1. In this paper, we present a
comprehensive and systematic review of PEFT methods for PLMs. We summarize
these PEFT methods, discuss their applications, and outline future directions.
Furthermore, we conduct experiments using several representative PEFT methods
to better understand their effectiveness in parameter efficiency and memory
efficiency. By offering insights into the latest advancements and practical
applications, this survey serves as an invaluable resource for researchers and
practitioners seeking to navigate the challenges and opportunities presented by
PEFT in the context of PLMs.
comment: 20 pages, 4 figures
☆ Exploring the Residual Stream of Transformers
Transformer-based models have achieved great breakthroughs in recent years.
However, there are many significant questions that have not been answered in
the field of explaining the reason why the models have powerful outputs. We do
not know how to locate the models' important parameters storing the knowledge
for predicting the next word, and whether these parameters are stored on the
same layer/module or different ones. Moreover, we do not understand the
mechanism to merge the knowledge into the final embedding for next word
prediction. In this paper, we explore the residual stream of transformers to
increase the interpretability. We find the mechanism behind residual connection
is a direct addition function on before-softmax values, so the probabilities of
tokens with larger before-softmax values will increase. Moreover, we prove that
using log probability increase as contribution scores is reasonable, and based
on this we can locate important parameters. Besides, we propose a method to
analyze how previous layers affect upper layers by comparing the inner
products. The experimental results and case study show that our research can
increase the interpretability of transformer-based models. We will release our
code on https://github.com/zepingyu0512/residualstream.
☆ Knowledge Graph Error Detection with Contrastive Confidence Adaption AAAI
Knowledge graphs (KGs) often contain various errors. Previous works on
detecting errors in KGs mainly rely on triplet embedding from graph structure.
We conduct an empirical study and find that these works struggle to
discriminate noise from semantically-similar correct triplets. In this paper,
we propose a KG error detection model CCA to integrate both textual and graph
structural information from triplet reconstruction for better distinguishing
semantics. We design interactive contrastive learning to capture the
differences between textual and structural patterns. Furthermore, we construct
realistic datasets with semantically-similar noise and adversarial noise.
Experimental results demonstrate that CCA outperforms state-of-the-art
baselines, especially in detecting semantically-similar noise and adversarial
noise.
comment: Accepted in the 38th AAAI Conference on Artificial Intelligence (AAAI
2024)
☆ Founder-GPT: Self-play to evaluate the Founder-Idea fit
This research introduces an innovative evaluation method for the
"founder-idea" fit in early-stage startups, utilizing advanced large language
model techniques to assess founders' profiles against their startup ideas to
enhance decision-making. Embeddings, self-play, tree-of-thought, and
critique-based refinement techniques show early promising results that each
idea's success patterns are unique and they should be evaluated based on the
context of the founder's background.
☆ Synergistic Anchored Contrastive Pre-training for Few-Shot Relation Extraction
Few-shot Relation Extraction (FSRE) aims to extract relational facts from a
sparse set of labeled corpora. Recent studies have shown promising results in
FSRE by employing Pre-trained Language Models (PLMs) within the framework of
supervised contrastive learning, which considers both instances and label
facts. However, how to effectively harness massive instance-label pairs to
encompass the learned representation with semantic richness in this learning
paradigm is not fully explored. To address this gap, we introduce a novel
synergistic anchored contrastive pre-training framework. This framework is
motivated by the insight that the diverse viewpoints conveyed through
instance-label pairs capture incomplete yet complementary intrinsic textual
semantics. Specifically, our framework involves a symmetrical contrastive
objective that encompasses both sentence-anchored and label-anchored
contrastive losses. By combining these two losses, the model establishes a
robust and uniform representation space. This space effectively captures the
reciprocal alignment of feature distributions among instances and relational
facts, simultaneously enhancing the maximization of mutual information across
diverse perspectives within the same relation. Experimental results demonstrate
that our framework achieves significant performance enhancements compared to
baseline models in downstream FSRE tasks. Furthermore, our approach exhibits
superior adaptability to handle the challenges of domain shift and zero-shot
relation extraction. Our code is available online at
https://github.com/AONE-NLP/FSRE-SaCon.
☆ Active Preference Inference using Language Models and Probabilistic Reasoning
Actively inferring user preferences, for example by asking good questions, is
important for any human-facing decision-making system. Active inference allows
such systems to adapt and personalize themselves to nuanced individual
preferences. To enable this ability for instruction-tuned large language models
(LLMs), one may prompt them to ask users questions to infer their preferences,
transforming the language models into more robust, interactive systems.
However, out of the box, these models are not efficient at extracting
preferences: the questions they generate are not informative, requiring a high
number of user interactions and impeding the usability of the downstream
system. In this work, we introduce an inference-time algorithm that helps LLMs
quickly infer preferences by using more informative questions. Our algorithm
uses a probabilistic model whose conditional distributions are defined by
prompting an LLM, and returns questions that optimize expected entropy and
expected model change. Results in a simplified interactive web shopping setting
with real product items show that an LLM equipped with our entropy reduction
algorithm outperforms baselines with the same underlying LLM on task
performance while using fewer user interactions.
☆ Can ChatGPT be Your Personal Medical Assistant?
The advanced large language model (LLM) ChatGPT has shown its potential in
different domains and remains unbeaten due to its characteristics compared to
other LLMs. This study aims to evaluate the potential of using a fine-tuned
ChatGPT model as a personal medical assistant in the Arabic language. To do so,
this study uses publicly available online questions and answering datasets in
Arabic language. There are almost 430K questions and answers for 20
disease-specific categories. GPT-3.5-turbo model was fine-tuned with a portion
of this dataset. The performance of this fine-tuned model was evaluated through
automated and human evaluation. The automated evaluations include perplexity,
coherence, similarity, and token count. Native Arabic speakers with medical
knowledge evaluated the generated text by calculating relevance, accuracy,
precision, logic, and originality. The overall result shows that ChatGPT has a
bright future in medical assistance.
comment: 5 pages, 7 figures, two tables, Accepted on The International
Symposium on Foundation and Large Language Models (FLLM2023)
☆ Coreference Graph Guidance for Mind-Map Generation AAAI 2024
Mind-map generation aims to process a document into a hierarchical structure
to show its central idea and branches. Such a manner is more conducive to
understanding the logic and semantics of the document than plain text.
Recently, a state-of-the-art method encodes the sentences of a document
sequentially and converts them to a relation graph via sequence-to-graph.
Though this method is efficient to generate mind-maps in parallel, its
mechanism focuses more on sequential features while hardly capturing structural
information. Moreover, it's difficult to model long-range semantic relations.
In this work, we propose a coreference-guided mind-map generation network
(CMGN) to incorporate external structure knowledge. Specifically, we construct
a coreference graph based on the coreference semantic relationship to introduce
the graph structure information. Then we employ a coreference graph encoder to
mine the potential governing relations between sentences. In order to exclude
noise and better utilize the information of the coreference graph, we adopt a
graph enhancement module in a contrastive learning manner. Experimental results
demonstrate that our model outperforms all the existing methods. The case study
further proves that our model can more accurately and concisely reveal the
structure and semantics of a document. Code and data are available at
https://github.com/Cyno2232/CMGN.
comment: 9 pages, 6 figures. Accepted by AAAI 2024
☆ Climate Change from Large Language Models
Climate change presents significant challenges to the global community, and
it is imperative to raise widespread awareness of the climate crisis and
educate users about low-carbon living. Artificial intelligence, particularly
large language models (LLMs), have emerged as powerful tools in mitigating the
climate crisis, leveraging their extensive knowledge, broad user base, and
natural language interaction capabilities. However, despite the growing body of
research on climate change, there is a lack of comprehensive assessments of
climate crisis knowledge within LLMs. This paper aims to resolve this gap by
proposing an automatic evaluation framework. We employ a hybrid approach to
data acquisition that combines data synthesis and manual collection to compile
a diverse set of questions related to the climate crisis. These questions cover
various aspects of climate change, including its causes, impacts, mitigation
strategies, and adaptation measures. We then evaluate the model knowledge
through prompt engineering based on the collected questions and generated
answers. We propose a set of comprehensive metrics to evaluate the climate
crisis knowledge, incorporating indicators from 10 different perspectives.
Experimental results show that our method is effective in evaluating the
knowledge of LLMs regarding the climate crisis. We evaluate several
state-of-the-art LLMs and find that their knowledge falls short in terms of
timeliness.
☆ Fluctuation-based Adaptive Structured Pruning for Large Language Models AAAI 2024
Network Pruning is a promising way to address the huge computing resource
demands of the deployment and inference of Large Language Models (LLMs).
Retraining-free is important for LLMs' pruning methods. However, almost all of
the existing retraining-free pruning approaches for LLMs focus on unstructured
pruning, which requires specific hardware support for acceleration. In this
paper, we propose a novel retraining-free structured pruning framework for
LLMs, named FLAP (FLuctuation-based Adaptive Structured Pruning). It is
hardware-friendly by effectively reducing storage and enhancing inference
speed. For effective structured pruning of LLMs, we highlight three critical
elements that demand the utmost attention: formulating structured importance
metrics, adaptively searching the global compressed model, and implementing
compensation mechanisms to mitigate performance loss. First, FLAP determines
whether the output feature map is easily recoverable when a column of weight is
removed, based on the fluctuation pruning metric. Then it standardizes the
importance scores to adaptively determine the global compressed model
structure. At last, FLAP adds additional bias terms to recover the output
feature maps using the baseline values. We thoroughly evaluate our approach on
a variety of language benchmarks. Without any retraining, our method
significantly outperforms the state-of-the-art methods, including LLM-Pruner
and the extension of Wanda in structured pruning. The code is released at
https://github.com/CASIA-IVA-Lab/FLAP.
comment: Accepted to AAAI 2024
☆ Large Language Models Empowered Agent-based Modeling and Simulation: A Survey and Perspectives
Agent-based modeling and simulation has evolved as a powerful tool for
modeling complex systems, offering insights into emergent behaviors and
interactions among diverse agents. Integrating large language models into
agent-based modeling and simulation presents a promising avenue for enhancing
simulation capabilities. This paper surveys the landscape of utilizing large
language models in agent-based modeling and simulation, examining their
challenges and promising future directions. In this survey, since this is an
interdisciplinary field, we first introduce the background of agent-based
modeling and simulation and large language model-empowered agents. We then
discuss the motivation for applying large language models to agent-based
simulation and systematically analyze the challenges in environment perception,
human alignment, action generation, and evaluation. Most importantly, we
provide a comprehensive overview of the recent works of large language
model-empowered agent-based modeling and simulation in multiple scenarios,
which can be divided into four domains: cyber, physical, social, and hybrid,
covering simulation of both real-world and virtual environments. Finally, since
this area is new and quickly evolving, we discuss the open problems and
promising future directions.
comment: 37 pages
☆ Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling AAAI'2024
Conversational Speech Synthesis (CSS) aims to accurately express an utterance
with the appropriate prosody and emotional inflection within a conversational
setting. While recognising the significance of CSS task, the prior studies have
not thoroughly investigated the emotional expressiveness problems due to the
scarcity of emotional conversational datasets and the difficulty of stateful
emotion modeling. In this paper, we propose a novel emotional CSS model, termed
ECSS, that includes two main components: 1) to enhance emotion understanding,
we introduce a heterogeneous graph-based emotional context modeling mechanism,
which takes the multi-source dialogue history as input to model the dialogue
context and learn the emotion cues from the context; 2) to achieve emotion
rendering, we employ a contrastive learning-based emotion renderer module to
infer the accurate emotion style for the target utterance. To address the issue
of data scarcity, we meticulously create emotional labels in terms of category
and intensity, and annotate additional emotional information on the existing
conversational dataset (DailyTalk). Both objective and subjective evaluations
suggest that our model outperforms the baseline models in understanding and
rendering emotions. These evaluations also underscore the importance of
comprehensive emotional annotations. Code and audio samples can be found at:
https://github.com/walker-hyf/ECSS.
comment: 9 pages, 4 figures, Accepted by AAAI'2024, Code and audio samples:
https://github.com/walker-hyf/ECSS
☆ Multi-Granularity Information Interaction Framework for Incomplete Utterance Rewriting EMNLP2023
Recent approaches in Incomplete Utterance Rewriting (IUR) fail to capture the
source of important words, which is crucial to edit the incomplete utterance,
and introduce words from irrelevant utterances. We propose a novel and
effective multi-task information interaction framework including context
selection, edit matrix construction, and relevance merging to capture the
multi-granularity of semantic information. Benefiting from fetching the
relevant utterance and figuring out the important words, our approach
outperforms existing state-of-the-art models on two benchmark datasets
Restoration-200K and CANAND in this field. Code will be provided on
\url{https://github.com/yanmenxue/QR}.
comment: Findings of EMNLP2023 (short)
☆ Relation-Aware Question Answering for Heterogeneous Knowledge Graphs EMNLP2023
Multi-hop Knowledge Base Question Answering(KBQA) aims to find the answer
entity in a knowledge graph (KG), which requires multiple steps of reasoning.
Existing retrieval-based approaches solve this task by concentrating on the
specific relation at different hops and predicting the intermediate entity
within the reasoning path. During the reasoning process of these methods, the
representation of relations are fixed but the initial relation representation
may not be optimal. We claim they fail to utilize information from head-tail
entities and the semantic connection between relations to enhance the current
relation representation, which undermines the ability to capture information of
relations in KGs. To address this issue, we construct a \textbf{dual relation
graph} where each node denotes a relation in the original KG (\textbf{primal
entity graph}) and edges are constructed between relations sharing same head or
tail entities. Then we iteratively do primal entity graph reasoning, dual
relation graph information propagation, and interaction between these two
graphs. In this way, the interaction between entity and relation is enhanced,
and we derive better entity and relation representations. Experiments on two
public datasets, WebQSP and CWQ, show that our approach achieves a significant
performance gain over the prior state-of-the-art. Our code is available on
\url{https://github.com/yanmenxue/RAH-KBQA}.
comment: Findings of EMNLP2023 (Long)
☆ External Knowledge Augmented Polyphone Disambiguation Using Large Language Model
One of the key issues in Mandarin Chinese text-to-speech (TTS) systems is
polyphone disambiguation when doing grapheme-to-phoneme (G2P) conversion. In
this paper, we introduce a novel method to solve the problem as a generation
task. Following the trending research of large language models (LLM) and prompt
learning, the proposed method consists of three modules. Retrieval module
incorporates external knowledge which is a multi-level semantic dictionary of
Chinese polyphonic characters to format the sentence into a prompt. Generation
module adopts the decoder-only Transformer architecture to induce the target
text. Postprocess module corrects the generated text into a valid result if
needed. Experimental results show that our method outperforms the existing
methods on a public dataset called CPP. We also empirically study the impacts
of different templates of the prompt, different sizes of training data, and
whether to incorporate external knowledge.
☆ Analyzing Public Reactions, Perceptions, and Attitudes during the MPox Outbreak: Findings from Topic Modeling of Tweets
The recent outbreak of the MPox virus has resulted in a tremendous increase
in the usage of Twitter. Prior works in this area of research have primarily
focused on the sentiment analysis and content analysis of these Tweets, and the
few works that have focused on topic modeling have multiple limitations. This
paper aims to address this research gap and makes two scientific contributions
to this field. First, it presents the results of performing Topic Modeling on
601,432 Tweets about the 2022 Mpox outbreak that were posted on Twitter between
7 May 2022 and 3 March 2023. The results indicate that the conversations on
Twitter related to Mpox during this time range may be broadly categorized into
four distinct themes - Views and Perspectives about Mpox, Updates on Cases and
Investigations about Mpox, Mpox and the LGBTQIA+ Community, and Mpox and
COVID-19. Second, the paper presents the findings from the analysis of these
Tweets. The results show that the theme that was most popular on Twitter (in
terms of the number of Tweets posted) during this time range was Views and
Perspectives about Mpox. This was followed by the theme of Mpox and the
LGBTQIA+ Community, which was followed by the themes of Mpox and COVID-19 and
Updates on Cases and Investigations about Mpox, respectively. Finally, a
comparison with related studies in this area of research is also presented to
highlight the novelty and significance of this research work.
☆ Difficulty-Focused Contrastive Learning for Knowledge Tracing with a Large Language Model-Based Difficulty Prediction
Unggi Lee, Sungjun Yoon, Joon Seo Yun, Kyoungsoo Park, YoungHoon Jung, Damji Stratton, Hyeoncheol Kim
This paper presents novel techniques for enhancing the performance of
knowledge tracing (KT) models by focusing on the crucial factor of question and
concept difficulty level. Despite the acknowledged significance of difficulty,
previous KT research has yet to exploit its potential for model optimization
and has struggled to predict difficulty from unseen data. To address these
problems, we propose a difficulty-centered contrastive learning method for KT
models and a Large Language Model (LLM)-based framework for difficulty
prediction. These innovative methods seek to improve the performance of KT
models and provide accurate difficulty estimates for unseen data. Our ablation
study demonstrates the efficacy of these techniques by demonstrating enhanced
KT model performance. Nonetheless, the complex relationship between language
and difficulty merits further investigation.
comment: 10 pages, 4 figures, 2 tables
☆ ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference AAAI24
Early Exiting is one of the most popular methods to achieve efficient
inference. Current early exiting methods adopt the (weighted) sum of the cross
entropy loss of all internal classifiers during training, imposing all these
classifiers to predict all instances correctly. However, during inference, as
long as one internal classifier predicts an instance correctly, it can
accelerate without losing accuracy. Thus, there is a notable gap between
training and inference. We propose ConsistentEE, an early exiting method that
is consistent in training and inference. ConsistentEE formulates the early
exiting process as a reinforcement learning problem. A policy network is added
to decide whether an instance should exit or continue. The training objective
of ConsistentEE only require each instance to be predicted correctly by one
internal classifier. Additionally, we introduce the concept Memorize Layer to
measure the hardness of an instance. We incorporate memorized layer into reward
function design, which allows ``easy'' instances to focus more on acceleration
while ``hard'' instances to focus more on accuracy. Experimental results show
that our method outperforms other baselines on various natural language
understanding and generation tasks.
comment: Accepted in AAAI24
☆ Punctuation restoration Model and Spacing Model for Korean Ancient Document
In Korean ancient documents, there is no spacing or punctuation, and they are
written in classical Chinese characters. This makes it challenging for modern
individuals and translation models to accurately interpret and translate them.
While China has models predicting punctuation and spacing, applying them
directly to Korean texts is problematic due to data differences. Therefore, we
developed the first models which predict punctuation and spacing for Korean
historical texts and evaluated their performance. Our punctuation restoration
model achieved an F1 score of 0.84, and Spacing model achieved a score of 0.96.
It has the advantage of enabling inference on low-performance GPUs with less
VRAM while maintaining quite high accuracy.
comment: 5 Pages, 2 Figures
☆ Sparse is Enough in Fine-tuning Pre-trained Large Language Model
With the prevalence of pre-training-fine-tuning paradigm, how to efficiently
adapt the pre-trained model to the downstream tasks has been an intriguing
issue. Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed for
low-cost adaptation, including Adapters, Bia-only, and the recently widely used
Low-Rank Adaptation. Although these methods have demonstrated their
effectiveness to some extent and have been widely applied, the underlying
principles are still unclear. In this paper, we reveal the transition of loss
landscape in the downstream domain from random initialization to pre-trained
initialization, that is, from low-amplitude oscillation to high-amplitude
oscillation. The parameter gradients exhibit a property akin to sparsity, where
a small fraction of components dominate the total gradient norm, for instance,
1% of the components account for 99% of the gradient. This property ensures
that the pre-trained model can easily find a flat minimizer which guarantees
the model's ability to generalize even with a low number of trainable
parameters. Based on this, we propose a gradient-based sparse fine-tuning
algorithm, named Sparse Increment Fine-Tuning (SIFT), and validate its
effectiveness on a range of tasks including the GLUE Benchmark and
Instruction-tuning. The code is accessible at https://github.com/song-wx/SIFT/.
☆ A Revisit of Fake News Dataset with Augmented Fact-checking by ChatGPT
The proliferation of fake news has emerged as a critical issue in recent
years, requiring significant efforts to detect it. However, the existing fake
news detection datasets are sourced from human journalists, which are likely to
have inherent bias limitations due to the highly subjective nature of this
task. In this paper, we revisit the existing fake news dataset verified by
human journalists with augmented fact-checking by large language models
(ChatGPT), and we name the augmented fake news dataset ChatGPT-FC. We
quantitatively analyze the distinctions and resemblances between human
journalists and LLM in assessing news subject credibility, news creator
credibility, time-sensitive, and political framing. Our findings highlight
LLM's potential to serve as a preliminary screening method, offering a
promising avenue to mitigate the inherent biases of human journalists and
enhance fake news detection.
☆ Predicting Human Translation Difficulty with Neural Machine Translation
Human translators linger on some words and phrases more than others, and
predicting this variation is a step towards explaining the underlying cognitive
processes. Using data from the CRITT Translation Process Research Database, we
evaluate the extent to which surprisal and attentional features derived from a
Neural Machine Translation (NMT) model account for reading and production times
of human translators. We find that surprisal and attention are complementary
predictors of translation difficulty, and that surprisal derived from a NMT
model is the single most successful predictor of production duration. Our
analyses draw on data from hundreds of translators operating across 13 language
pairs, and represent the most comprehensive investigation of human translation
difficulty to date.
☆ TESS: A Multi-intent Parser for Conversational Multi-Agent Systems with Decentralized Natural Language Understanding Models
Chatbots have become one of the main pathways for the delivery of business
automation tools. Multi-agent systems offer a framework for designing chatbots
at scale, making it easier to support complex conversations that span across
multiple domains as well as enabling developers to maintain and expand their
capabilities incrementally over time. However, multi-agent systems complicate
the natural language understanding (NLU) of user intents, especially when they
rely on decentralized NLU models: some utterances (termed single intent) may
invoke a single agent while others (termed multi-intent) may explicitly invoke
multiple agents. Without correctly parsing multi-intent inputs, decentralized
NLU approaches will not achieve high prediction accuracy. In this paper, we
propose an efficient parsing and orchestration pipeline algorithm to service
multi-intent utterances from the user in the context of a multi-agent system.
Our proposed approach achieved comparable performance to competitive deep
learning models on three different datasets while being up to 48 times faster.
comment: 16 pages
☆ An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training
Youshao Xiao, Weichang Wu, Zhenglei Zhou, Fagui Mao, Shangchun Zhao, Lin Ju, Lei Liang, Xiaolu Zhang, Jun Zhou
Recently, ChatGPT or InstructGPT like large language models (LLM) has made a
significant impact in the AI world. These models are incredibly versatile,
capable of performing language tasks on par or even exceeding the capabilities
of human experts. Many works have attempted to reproduce the complex
InstructGPT's RLHF (Reinforcement Learning with Human Feedback) training
pipeline. However, the mainstream distributed RLHF training methods typically
adopt a fixed model placement strategy, referred to as the Flattening strategy.
This strategy treats all four models involved in RLHF as a single entity and
places them on all devices, regardless of their differences. Unfortunately,
this strategy exacerbates the generation bottlenecks in the RLHF training and
degrades the overall training efficiency. To address these issues, we propose
an adaptive model placement framework that offers two flexible model placement
strategies. These strategies allow for the agile allocation of models across
devices in a fine-grained manner. The Interleaving strategy helps reduce memory
redundancy and communication costs during RLHF training. On the other hand, the
Separation strategy improves the throughput of model training by separating the
training and generation stages of the RLHF pipeline. Notably, this framework
seamlessly integrates with other mainstream techniques for acceleration and
enables automatic hyperparameter search. Extensive experiments have
demonstrated that our Interleaving and Separation strategies can achieve
notable improvements up to 11x, compared to the current state-of-the-art (SOTA)
approaches. These experiments encompassed a wide range of training scenarios,
involving models of varying sizes and devices of different scales. The results
highlight the effectiveness and superiority of our approaches in accelerating
the training of distributed RLHF.
☆ Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, George Tucker, Enrique Piqueras, Maxim Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, Anaïs White, Anders Andreassen, Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez, Misha Khalman, Jakub Sygnowski, Alexandre Frechette, Charlotte Smith, Laura Culp, Lev Proleev, Yi Luan, Xi Chen, James Lottes, Nathan Schucher, Federico Lebron, Alban Rrustemi, Natalie Clay, Phil Crone, Tomas Kocisky, Jeffrey Zhao, Bartek Perz, Dian Yu, Heidi Howard, Adam Bloniarz, Jack W. Rae, Han Lu, Laurent Sifre, Marcello Maggioni, Fred Alcober, Dan Garrette, Megan Barnes, Shantanu Thakoor, Jacob Austin, Gabriel Barth-Maron, William Wong, Rishabh Joshi, Rahma Chaabouni, Deeni Fatiha, Arun Ahuja, Ruibo Liu, Yunxuan Li, Sarah Cogan, Jeremy Chen, Chao Jia, Chenjie Gu, Qiao Zhang, Jordan Grimstad, Ale Jakse Hartman, Martin Chadwick, Gaurav Singh Tomar, Xavier Garcia, Evan Senter, Emanuel Taropa, Thanumalayan Sankaranarayana Pillai, Jacob Devlin, Michael Laskin, Diego de Las Casas, Dasha Valter, Connie Tao, Lorenzo Blanco, Adrià Puigdomènech Badia, David Reitter, Mianna Chen, Jenny Brennan, Clara Rivera, Sergey Brin, Shariq Iqbal, Gabriela Surita, Jane Labanowski, Abhi Rao, Stephanie Winkler, Emilio Parisotto, Yiming Gu, Kate Olszewska, Yujing Zhang, Ravi Addanki, Antoine Miech, Annie Louis, Laurent El Shafey, Denis Teplyashin, Geoff Brown, Elliot Catt, Nithya Attaluri, Jan Balaguer, Jackie Xiang, Pidong Wang, Zoe Ashwood, Anton Briukhov, Albert Webson, Sanjay Ganapathy, Smit Sanghavi, Ajay Kannan, Ming-Wei Chang, Axel Stjerngren, Josip Djolonga, Yuting Sun, Ankur Bapna, Matthew Aitchison, Pedram Pejman, Henryk Michalewski, Tianhe Yu, Cindy Wang, Juliette Love, Junwhan Ahn, Dawn Bloxwich, Kehang Han, Peter Humphreys, Thibault Sellam, James Bradbury, Varun Godbole, Sina Samangooei, Bogdan Damoc, Alex Kaskasoli, Sébastien M. R. Arnold, Vijay Vasudevan, Shubham Agrawal, Jason Riesa, Dmitry Lepikhin, Richard Tanburn, Srivatsan Srinivasan, Hyeontaek Lim, Sarah Hodkinson, Pranav Shyam, Johan Ferret, Steven Hand, Ankush Garg, Tom Le Paine, Jian Li, Yujia Li, Minh Giang, Alexander Neitz, Zaheer Abbas, Sarah York, Machel Reid, Elizabeth Cole, Aakanksha Chowdhery, Dipanjan Das, Dominika Rogozińska, Vitaly Nikolaev, Pablo Sprechmann, Zachary Nado, Lukas Zilka, Flavien Prost, Luheng He, Marianne Monteiro, Gaurav Mishra, Chris Welty, Josh Newlan, Dawei Jia, Miltiadis Allamanis, Clara Huiyi Hu, Raoul de Liedekerke, Justin Gilmer, Carl Saroufim, Shruti Rijhwani, Shaobo Hou, Disha Shrivastava, Anirudh Baddepudi, Alex Goldin, Adnan Ozturel, Albin Cassirer, Yunhan Xu, Daniel Sohn, Devendra Sachan, Reinald Kim Amplayo, Craig Swanson, Dessie Petrova, Shashi Narayan, Arthur Guez, Siddhartha Brahma, Jessica Landon, Miteyan Patel, Ruizhe Zhao, Kevin Villela, Luyu Wang, Wenhao Jia, Matthew Rahtz, Mai Giménez, Legg Yeung, Hanzhao Lin, James Keeling, Petko Georgiev, Diana Mincu, Boxi Wu, Salem Haykal, Rachel Saputro, Kiran Vodrahalli, James Qin, Zeynep Cankara, Abhanshu Sharma, Nick Fernando, Will Hawkins, Behnam Neyshabur, Solomon Kim, Adrian Hutter, Priyanka Agrawal, Alex Castro-Ros, George van den Driessche, Tao Wang, Fan Yang, Shuo-yiin Chang, Paul Komarek, Ross McIlroy, Mario Lučić, Guodong Zhang, Wael Farhan, Michael Sharman, Paul Natsev, Paul Michel, Yong Cheng, Yamini Bansal, Siyuan Qiao, Kris Cao, Siamak Shakeri, Christina Butterfield, Justin Chung, Paul Kishan Rubenstein, Shivani Agrawal, Arthur Mensch, Kedar Soparkar, Karel Lenc, Timothy Chung, Aedan Pope, Loren Maggiore, Jackie Kay, Priya Jhakra, Shibo Wang, Joshua Maynez, Mary Phuong, Taylor Tobin, Andrea Tacchetti, Maja Trebacz, Kevin Robinson, Yash Katariya, Sebastian Riedel, Paige Bailey, Kefan Xiao, Nimesh Ghelani, Lora Aroyo, Ambrose Slone, Neil Houlsby, Xuehan Xiong, Zhen Yang, Elena Gribovskaya, Jonas Adler, Mateo Wirth, Lisa Lee, Music Li, Thais Kagohara, Jay Pavagadhi, Sophie Bridgers, Anna Bortsova, Sanjay Ghemawat, Zafarali Ahmed, Tianqi Liu, Richard Powell, Vijay Bolina, Mariko Iinuma, Polina Zablotskaia, James Besley, Da-Woon Chung, Timothy Dozat, Ramona Comanescu, Xiance Si, Jeremy Greer, Guolong Su, Martin Polacek, Raphaël Lopez Kaufman, Simon Tokumine, Hexiang Hu, Elena Buchatskaya, Yingjie Miao, Mohamed Elhawaty, Aditya Siddhant, Nenad Tomasev, Jinwei Xing, Christina Greer, Helen Miller, Shereen Ashraf, Aurko Roy, Zizhao Zhang, Ada Ma, Angelos Filos, Milos Besta, Rory Blevins, Ted Klimenko, Chih-Kuan Yeh, Soravit Changpinyo, Jiaqi Mu, Oscar Chang, Mantas Pajarskas, Carrie Muir, Vered Cohen, Charline Le Lan, Krishna Haridasan, Amit Marathe, Steven Hansen, Sholto Douglas, Rajkumar Samuel, Mingqiu Wang, Sophia Austin, Chang Lan, Jiepu Jiang, Justin Chiu, Jaime Alonso Lorenzo, Lars Lowe Sjösund, Sébastien Cevey, Zach Gleicher, Thi Avrahami, Anudhyan Boral, Hansa Srinivasan, Vittorio Selo, Rhys May, Konstantinos Aisopos, Léonard Hussenot, Livio Baldini Soares, Kate Baumli, Michael B. Chang, Adrià Recasens, Ben Caine, Alexander Pritzel, Filip Pavetic, Fabio Pardo, Anita Gergely, Justin Frye, Vinay Ramasesh, Dan Horgan, Kartikeya Badola, Nora Kassner, Subhrajit Roy, Ethan Dyer, Víctor Campos, Alex Tomala, Yunhao Tang, Dalia El Badawy, Elspeth White, Basil Mustafa, Oran Lang, Abhishek Jindal, Sharad Vikram, Zhitao Gong, Sergi Caelles, Ross Hemsley, Gregory Thornton, Fangxiaoyu Feng, Wojciech Stokowiec, Ce Zheng, Phoebe Thacker, Çağlar Ünlü, Zhishuai Zhang, Mohammad Saleh, James Svensson, Max Bileschi, Piyush Patil, Ankesh Anand, Roman Ring, Katerina Tsihlas, Arpi Vezer, Marco Selvi, Toby Shevlane, Mikel Rodriguez, Tom Kwiatkowski, Samira Daruki, Keran Rong, Allan Dafoe, Nicholas FitzGerald, Keren Gu-Lemberg, Mina Khan, Lisa Anne Hendricks, Marie Pellat, Vladimir Feinberg, James Cobon-Kerr, Tara Sainath, Maribeth Rauh, Sayed Hadi Hashemi, Richard Ives, Yana Hasson, YaGuang Li, Eric Noland, Yuan Cao, Nathan Byrd, Le Hou, Qingze Wang, Thibault Sottiaux, Michela Paganini, Jean-Baptiste Lespiau, Alexandre Moufarek, Samer Hassan, Kaushik Shivakumar, Joost van Amersfoort, Amol Mandhane, Pratik Joshi, Anirudh Goyal, Matthew Tung, Andrew Brock, Hannah Sheahan, Vedant Misra, Cheng Li, Nemanja Rakićević, Mostafa Dehghani, Fangyu Liu, Sid Mittal, Junhyuk Oh, Seb Noury, Eren Sezener, Fantine Huot, Matthew Lamm, Nicola De Cao, Charlie Chen, Gamaleldin Elsayed, Ed Chi, Mahdis Mahdieh, Ian Tenney, Nan Hua, Ivan Petrychenko, Patrick Kane, Dylan Scandinaro, Rishub Jain, Jonathan Uesato, Romina Datta, Adam Sadovsky, Oskar Bunyan, Dominik Rabiej, Shimu Wu, John Zhang, Gautam Vasudevan, Edouard Leurent, Mahmoud Alnahlawi, Ionut Georgescu, Nan Wei, Ivy Zheng, Betty Chan, Pam G Rabinovitch, Piotr Stanczyk, Ye Zhang, David Steiner, Subhajit Naskar, Michael Azzam, Matthew Johnson, Adam Paszke, Chung-Cheng Chiu, Jaume Sanchez Elias, Afroz Mohiuddin, Faizan Muhammad, Jin Miao, Andrew Lee, Nino Vieillard, Sahitya Potluri, Jane Park, Elnaz Davoodi, Jiageng Zhang, Jeff Stanway, Drew Garmon, Abhijit Karmarkar, Zhe Dong, Jong Lee, Aviral Kumar, Luowei Zhou, Jonathan Evens, William Isaac, Zhe Chen, Johnson Jia, Anselm Levskaya, Zhenkai Zhu, Chris Gorgolewski, Peter Grabowski, Yu Mao, Alberto Magni, Kaisheng Yao, Javier Snaider, Norman Casagrande, Paul Suganthan, Evan Palmer, Geoffrey Irving, Edward Loper, Manaal Faruqui, Isha Arkatkar, Nanxin Chen, Izhak Shafran, Michael Fink, Alfonso Castaño, Irene Giannoumis, Wooyeol Kim, Mikołaj Rybiński, Ashwin Sreevatsa, Jennifer Prendki, David Soergel, Adrian Goedeckemeyer, Willi Gierke, Mohsen Jafari, Meenu Gaba, Jeremy Wiesner, Diana Gage Wright, Yawen Wei, Harsha Vashisht, Yana Kulizhskaya, Jay Hoover, Maigo Le, Lu Li, Chimezie Iwuanyanwu, Lu Liu, Kevin Ramirez, Andrey Khorlin, Albert Cui, Tian LIN, Marin Georgiev, Marcus Wu, Ricardo Aguilar, Keith Pallo, Abhishek Chakladar, Alena Repina, Xihui Wu, Tom van der Weide, Priya Ponnapalli, Caroline Kaplan, Jiri Simsa, Shuangfeng Li, Olivier Dousse, Fan Yang, Jeff Piper, Nathan Ie, Minnie Lui, Rama Pasumarthi, Nathan Lintz, Anitha Vijayakumar, Lam Nguyen Thiet, Daniel Andor, Pedro Valenzuela, Cosmin Paduraru, Daiyi Peng, Katherine Lee, Shuyuan Zhang, Somer Greene, Duc Dung Nguyen, Paula Kurylowicz, Sarmishta Velury, Sebastian Krause, Cassidy Hardin, Lucas Dixon, Lili Janzer, Kiam Choo, Ziqiang Feng, Biao Zhang, Achintya Singhal, Tejasi Latkar, Mingyang Zhang, Quoc Le, Elena Allica Abellan, Dayou Du, Dan McKinnon, Natasha Antropova, Tolga Bolukbasi, Orgad Keller, David Reid, Daniel Finchelstein, Maria Abi Raad, Remi Crocker, Peter Hawkins, Robert Dadashi, Colin Gaffney, Sid Lall, Ken Franko, Egor Filonov, Anna Bulanova, Rémi Leblond, Vikas Yadav, Shirley Chung, Harry Askham, Luis C. Cobo, Kelvin Xu, Felix Fischer, Jun Xu, Christina Sorokin, Chris Alberti, Chu-Cheng Lin, Colin Evans, Hao Zhou, Alek Dimitriev, Hannah Forbes, Dylan Banarse, Zora Tung, Jeremiah Liu, Mark Omernick, Colton Bishop, Chintu Kumar, Rachel Sterneck, Ryan Foley, Rohan Jain, Swaroop Mishra, Jiawei Xia, Taylor Bos, Geoffrey Cideron, Ehsan Amid, Francesco Piccinno, Xingyu Wang, Praseem Banzal, Petru Gurita, Hila Noga, Premal Shah, Daniel J. Mankowitz, Alex Polozov, Nate Kushman, Victoria Krakovna, Sasha Brown, MohammadHossein Bateni, Dennis Duan, Vlad Firoiu, Meghana Thotakuri, Tom Natan, Anhad Mohananey, Matthieu Geist, Sidharth Mudgal, Sertan Girgin, Hui Li, Jiayu Ye, Ofir Roval, Reiko Tojo, Michael Kwong, James Lee-Thorp, Christopher Yew, Quan Yuan, Sumit Bagri, Danila Sinopalnikov, Sabela Ramos, John Mellor, Abhishek Sharma, Aliaksei Severyn, Jonathan Lai, Kathy Wu, Heng-Tze Cheng, David Miller, Nicolas Sonnerat, Denis Vnukov, Rory Greig, Jennifer Beattie, Emily Caveness, Libin Bai, Julian Eisenschlos, Alex Korchemniy, Tomy Tsai, Mimi Jasarevic, Weize Kong, Phuong Dao, Zeyu Zheng, Frederick Liu, Fan Yang, Rui Zhu, Mark Geller, Tian Huey Teh, Jason Sanmiya, Evgeny Gladchenko, Nejc Trdin, Andrei Sozanschi, Daniel Toyama, Evan Rosen, Sasan Tavakkol, Linting Xue, Chen Elkind, Oliver Woodman, John Carpenter, George Papamakarios, Rupert Kemp, Sushant Kafle, Tanya Grunina, Rishika Sinha, Alice Talbert, Abhimanyu Goyal, Diane Wu, Denese Owusu-Afriyie, Cosmo Du, Chloe Thornton, Jordi Pont-Tuset, Pradyumna Narayana, Jing Li, Sabaer Fatehi, John Wieting, Omar Ajmeri, Benigno Uria, Tao Zhu, Yeongil Ko, Laura Knight, Amélie Héliou, Ning Niu, Shane Gu, Chenxi Pang, Dustin Tran, Yeqing Li, Nir Levine, Ariel Stolovich, Norbert Kalb, Rebeca Santamaria-Fernandez, Sonam Goenka, Wenny Yustalim, Robin Strudel, Ali Elqursh, Balaji Lakshminarayanan, Charlie Deck, Shyam Upadhyay, Hyo Lee, Mike Dusenberry, Zonglin Li, Xuezhi Wang, Kyle Levin, Raphael Hoffmann, Dan Holtmann-Rice, Olivier Bachem, Summer Yue, Sho Arora, Eric Malmi, Daniil Mirylenka, Qijun Tan, Christy Koh, Soheil Hassas Yeganeh, Siim Põder, Steven Zheng, Francesco Pongetti, Mukarram Tariq, Yanhua Sun, Lucian Ionita, Mojtaba Seyedhosseini, Pouya Tafti, Ragha Kotikalapudi, Zhiyu Liu, Anmol Gulati, Jasmine Liu, Xinyu Ye, Bart Chrzaszcz, Lily Wang, Nikhil Sethi, Tianrun Li, Ben Brown, Shreya Singh, Wei Fan, Aaron Parisi, Joe Stanton, Chenkai Kuang, Vinod Koverkathu, Christopher A. Choquette-Choo, Yunjie Li, TJ Lu, Abe Ittycheriah, Prakash Shroff, Pei Sun, Mani Varadarajan, Sanaz Bahargam, Rob Willoughby, David Gaddy, Ishita Dasgupta, Guillaume Desjardins, Marco Cornero, Brona Robenek, Bhavishya Mittal, Ben Albrecht, Ashish Shenoy, Fedor Moiseev, Henrik Jacobsson, Alireza Ghaffarkhah, Morgane Rivière, Alanna Walton, Clément Crepy, Alicia Parrish, Yuan Liu, Zongwei Zhou, Clement Farabet, Carey Radebaugh, Praveen Srinivasan, Claudia van der Salm, Andreas Fidjeland, Salvatore Scellato, Eri Latorre-Chimoto, Hanna Klimczak-Plucińska, David Bridson, Dario de Cesare, Tom Hudson, Piermaria Mendolicchio, Lexi Walker, Alex Morris, Ivo Penchev, Matthew Mauger, Alexey Guseynov, Alison Reid, Seth Odoom, Lucia Loher, Victor Cotruta, Madhavi Yenugula, Dominik Grewe, Anastasia Petrushkina, Tom Duerig, Antonio Sanchez, Steve Yadlowsky, Amy Shen, Amir Globerson, Adam Kurzrok, Lynette Webb, Sahil Dua, Dong Li, Preethi Lahoti, Surya Bhupatiraju, Dan Hurt, Haroon Qureshi, Ananth Agarwal, Tomer Shani, Matan Eyal, Anuj Khare, Shreyas Rammohan Belle, Lei Wang, Chetan Tekur, Mihir Sanjay Kale, Jinliang Wei, Ruoxin Sang, Brennan Saeta, Tyler Liechty, Yi Sun, Yao Zhao, Stephan Lee, Pandu Nayak, Doug Fritz, Manish Reddy Vuyyuru, John Aslanides, Nidhi Vyas, Martin Wicke, Xiao Ma, Taylan Bilal, Evgenii Eltyshev, Daniel Balle, Nina Martin, Hardie Cate, James Manyika, Keyvan Amiri, Yelin Kim, Xi Xiong, Kai Kang, Florian Luisier, Nilesh Tripuraneni, David Madras, Mandy Guo, Austin Waters, Oliver Wang, Joshua Ainslie, Jason Baldridge, Han Zhang, Garima Pruthi, Jakob Bauer, Feng Yang, Riham Mansour, Jason Gelman, Yang Xu, George Polovets, Ji Liu, Honglong Cai, Warren Chen, XiangHai Sheng, Emily Xue, Sherjil Ozair, Adams Yu, Christof Angermueller, Xiaowei Li, Weiren Wang, Julia Wiesinger, Emmanouil Koukoumidis, Yuan Tian, Anand Iyer, Madhu Gurumurthy, Mark Goldenson, Parashar Shah, MK Blake, Hongkun Yu, Anthony Urbanowicz, Jennimaria Palomaki, Chrisantha Fernando, Kevin Brooks, Ken Durden, Harsh Mehta, Nikola Momchev, Elahe Rahimtoroghi, Maria Georgaki, Amit Raul, Sebastian Ruder, Morgan Redshaw, Jinhyuk Lee, Komal Jalan, Dinghua Li, Ginger Perng, Blake Hechtman, Parker Schuh, Milad Nasr, Mia Chen, Kieran Milan, Vladimir Mikulik, Trevor Strohman, Juliana Franco, Tim Green, Demis Hassabis, Koray Kavukcuoglu, Jeffrey Dean, Oriol Vinyals
This report introduces a new family of multimodal models, Gemini, that
exhibit remarkable capabilities across image, audio, video, and text
understanding. The Gemini family consists of Ultra, Pro, and Nano sizes,
suitable for applications ranging from complex reasoning tasks to on-device
memory-constrained use-cases. Evaluation on a broad range of benchmarks shows
that our most-capable Gemini Ultra model advances the state of the art in 30 of
32 of these benchmarks - notably being the first model to achieve human-expert
performance on the well-studied exam benchmark MMLU, and improving the state of
the art in every one of the 20 multimodal benchmarks we examined. We believe
that the new capabilities of Gemini models in cross-modal reasoning and
language understanding will enable a wide variety of use cases and we discuss
our approach toward deploying them responsibly to users.
☆ Designing Guiding Principles for NLP for Healthcare: A Case Study of Maternal Health
Objective: An ethical framework for the use of large language models (LLMs)
is urgently needed to shape how natural language processing (NLP) tools are
used for healthcare applications. Drawing directly from the voices of those
most affected, we propose a set of guiding principles for the use of NLP in
healthcare, with examples based on applications in maternal health.
Materials and Methods: We led an interactive session centered on an LLM-based
chatbot demonstration during a full-day workshop with 39 participants, and
additionally surveyed 30 healthcare workers and 30 birthing people about their
values, needs, and perceptions of AI and LLMs. We conducted quantitative and
qualitative analyses of the interactive discussions to consolidate our findings
into a set of guiding principles.
Results: Using the case study of maternal health, we propose nine principles
for ethical use of LLMs, grouped into three categories: (i) contextual
significance, (ii) measurements, and (iii) who/what is valued. We describe
rationales underlying these principles and provide practical advice.
Discussion: Healthcare faces existing challenges including the balance of
power in clinician-patient relationships, systemic health disparities,
historical injustices, and economic constraints. Our principles serve as a
framework for surfacing key considerations when deploying LLMs in medicine, as
well as providing a methodological pattern for other researchers to follow.
Conclusion: This set of principles can serve as a resource to practitioners
working on maternal health and other healthcare fields to emphasize the
importance of technical nuance, historical context, and inclusive design when
developing LLMs for use in clinical settings.
☆ MELO: Enhancing Model Editing with Neuron-Indexed Dynamic LoRA AAAI
Large language models (LLMs) have shown great success in various Natural
Language Processing (NLP) tasks, whist they still need updates after deployment
to fix errors or keep pace with the changing knowledge in the world.
Researchers formulate such problem as Model Editing and have developed various
editors focusing on different axes of editing properties. However, current
editors can hardly support all properties and rely on heavy computational
resources. In this paper, we propose a plug-in Model Editing method based on
neuron-indexed dynamic LoRA (MELO), which alters the behavior of language
models by dynamically activating certain LoRA blocks according to the index
built in an inner vector database. Our method satisfies various editing
properties with high efficiency and can be easily integrated into multiple LLM
backbones. Experimental results show that our proposed MELO achieves
state-of-the-art editing performance on three sequential editing tasks
(document classification, question answering and hallucination correction),
while requires the least trainable parameters and computational cost.
comment: In Proceedings of The 38th Annual AAAI Conference on Artificial
Intelligence
☆ COOPER: Coordinating Specialized Agents towards a Complex Dialogue Goal AAAI 2024
In recent years, there has been a growing interest in exploring dialogues
with more complex goals, such as negotiation, persuasion, and emotional
support, which go beyond traditional service-focused dialogue systems. Apart
from the requirement for much more sophisticated strategic reasoning and
communication skills, a significant challenge of these tasks lies in the
difficulty of objectively measuring the achievement of their goals in a
quantifiable way, making it difficult for existing research to directly
optimize the dialogue procedure towards them. In our work, we emphasize the
multifaceted nature of complex dialogue goals and argue that it is more
feasible to accomplish them by comprehensively considering and jointly
promoting their different aspects. To this end, we propose a novel dialogue
framework, Cooper, which coordinates multiple specialized agents, each
dedicated to a specific dialogue goal aspect separately, to approach the
complex objective. Through this divide-and-conquer manner, we make complex
dialogue goals more approachable and elicit greater intelligence via the
collaboration of individual agents. Experiments on persuasion and emotional
support dialogues demonstrate the superiority of our method over a set of
competitive baselines.
comment: Accepted by AAAI 2024
☆ Zero-Shot Fact-Checking with Semantic Triples and Knowledge Graphs
Despite progress in automated fact-checking, most systems require a
significant amount of labeled training data, which is expensive. In this paper,
we propose a novel zero-shot method, which instead of operating directly on the
claim and evidence sentences, decomposes them into semantic triples augmented
using external knowledge graphs, and uses large language models trained for
natural language inference. This allows it to generalize to adversarial
datasets and domains that supervised models require specific training data for.
Our empirical results show that our approach outperforms previous zero-shot
approaches on FEVER, FEVER-Symmetric, FEVER 2.0, and Climate-FEVER, while being
comparable or better than supervised models on the adversarial and the
out-of-domain datasets.
☆ Are you talking to ['xem'] or ['x', 'em']? On Tokenization and Addressing Misgendering in LLMs with Pronoun Tokenization Parity
Anaelia Ovalle, Ninareh Mehrabi, Palash Goyal, Jwala Dhamala, Kai-Wei Chang, Richard Zemel, Aram Galstyan, Yuval Pinter, Rahul Gupta
A large body of NLP research has documented the ways gender biases manifest
and amplify within large language models (LLMs), though this research has
predominantly operated within a gender binary-centric context. A growing body
of work has identified the harmful limitations of this gender-exclusive
framing; many LLMs cannot correctly and consistently refer to persons outside
the gender binary, especially if they use neopronouns. While data scarcity has
been identified as a possible culprit, the precise mechanisms through which it
influences LLM misgendering remain underexplored. Our work addresses this gap
by studying data scarcity's role in subword tokenization and, consequently, the
formation of LLM word representations. We uncover how the Byte-Pair Encoding
(BPE) tokenizer, a backbone for many popular LLMs, contributes to neopronoun
misgendering through out-of-vocabulary behavior. We introduce pronoun
tokenization parity (PTP), a novel approach to reduce LLM neopronoun
misgendering by preserving a token's functional structure. We evaluate PTP's
efficacy using pronoun consistency-based metrics and a novel syntax-based
metric. Through several controlled experiments, finetuning LLMs with PTP
improves neopronoun consistency from 14.5% to 58.4%, highlighting the
significant role tokenization plays in LLM pronoun consistency.
comment: Accepted to 2023 Neurips Queer in AI workshop
♻ ☆ PoetryDiffusion: Towards Joint Semantic and Metrical Manipulation in Poetry Generation AAAI2024
Controllable text generation is a challenging and meaningful field in natural
language generation (NLG). Especially, poetry generation is a typical one with
well-defined and strict conditions for text generation which is an ideal
playground for the assessment of current methodologies. While prior works
succeeded in controlling either semantic or metrical aspects of poetry
generation, simultaneously addressing both remains a challenge. In this paper,
we pioneer the use of the Diffusion model for generating sonnets and Chinese
SongCi poetry to tackle such challenges. In terms of semantics, our
PoetryDiffusion model, built upon the Diffusion model, generates entire
sentences or poetry by comprehensively considering the entirety of sentence
information. This approach enhances semantic expression, distinguishing it from
autoregressive and large language models (LLMs). For metrical control, the
separation feature of diffusion generation and its constraint control module
enable us to flexibly incorporate a novel metrical controller to manipulate and
evaluate metrics (format and rhythm). The denoising process in PoetryDiffusion
allows for gradual enhancement of semantics and flexible integration of the
metrical controller which can calculate and impose penalties on states that
stray significantly from the target control distribution. Experimental results
on two datasets demonstrate that our model outperforms existing models in
automatic evaluation of semantic, metrical, and overall performance as well as
human evaluation.
comment: Accepted by AAAI2024
♻ ☆ A Baseline Analysis of Reward Models' Ability To Accurately Analyze Foundation Models Under Distribution Shift
Foundation models, specifically Large Language Models (LLM's), have lately
gained wide-spread attention and adoption. Reinforcement Learning with Human
Feedback (RLHF) involves training a reward model to capture desired behaviors,
which is then used to align LLM's. These reward models are additionally used at
inference-time to estimate LLM responses' adherence to those desired behaviors.
However, there is little work measuring how robust these reward models are to
distribution shifts. In this work, we evaluate how reward model performance -
measured via accuracy and calibration (i.e. alignment between accuracy and
confidence) - is affected by distribution shift. We show novel calibration
patterns and accuracy drops due to OOD prompts and responses, and that the
reward model is more sensitive to shifts in responses than prompts.
Additionally, we adapt an OOD detection technique commonly used in
classification to the reward model setting to detect these distribution shifts
in prompts and responses.
♻ ☆ Debiasing Multimodal Sarcasm Detection with Contrastive Learning
Despite commendable achievements made by existing work, prevailing multimodal
sarcasm detection studies rely more on textual content over visual information.
It unavoidably induces spurious correlations between textual words and labels,
thereby significantly hindering the models' generalization capability. To
address this problem, we define the task of out-of-distribution (OOD)
multimodal sarcasm detection, which aims to evaluate models' generalizability
when the word distribution is different in training and testing settings.
Moreover, we propose a novel debiasing multimodal sarcasm detection framework
with contrastive learning, which aims to mitigate the harmful effect of biased
textual factors for robust OOD generalization. In particular, we first design
counterfactual data augmentation to construct the positive samples with
dissimilar word biases and negative samples with similar word biases.
Subsequently, we devise an adapted debiasing contrastive learning mechanism to
empower the model to learn robust task-relevant features and alleviate the
adverse effect of biased words. Extensive experiments show the superiority of
the proposed framework.
♻ ☆ Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning EMNLP 2023
In-context learning (ICL) emerges as a promising capability of large language
models (LLMs) by providing them with demonstration examples to perform diverse
tasks. However, the underlying mechanism of how LLMs learn from the provided
context remains under-explored. In this paper, we investigate the working
mechanism of ICL through an information flow lens. Our findings reveal that
label words in the demonstration examples function as anchors: (1) semantic
information aggregates into label word representations during the shallow
computation layers' processing; (2) the consolidated information in label words
serves as a reference for LLMs' final predictions. Based on these insights, we
introduce an anchor re-weighting method to improve ICL performance, a
demonstration compression technique to expedite inference, and an analysis
framework for diagnosing ICL errors in GPT2-XL. The promising applications of
our findings again validate the uncovered ICL working mechanism and pave the
way for future studies.
comment: Accepted by EMNLP 2023
♻ ☆ Chain-of-Questions Training with Latent Answers for Robust Multistep Question Answering EMNLP 2023
We train a language model (LM) to robustly answer multistep questions by
generating and answering sub-questions. We propose Chain-of-Questions, a
framework that trains a model to generate sub-questions and sub-answers one at
a time by leveraging human annotated question decomposition meaning
representation (QDMR). The key technical challenge is that QDMR only contains
sub-questions but not answers to those sub-questions, so we treat sub-answers
as latent variables and optimize them using a novel dynamic mixture of Hard-EM
and MAPO. Chain-of-Questions greatly outperforms strong neuro-symbolic methods
by 9.0 F1 on DROP contrast set, and outperforms GPT-3.5 by 24.3 F1 on HOTPOTQA
adversarial set, thus demonstrating the effectiveness and robustness of our
framework.
comment: Accepted by the EMNLP 2023
♻ ☆ Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions? CVPR 2023
Data augmentation via back-translation is common when pretraining
Vision-and-Language Navigation (VLN) models, even though the generated
instructions are noisy. But: does that noise matter? We find that nonsensical
or irrelevant language instructions during pretraining can have little effect
on downstream performance for both HAMT and VLN-BERT on R2R, and is still
better than only using clean, human data. To underscore these results, we
concoct an efficient augmentation method, Unigram + Object, which generates
nonsensical instructions that nonetheless improve downstream performance. Our
findings suggest that what matters for VLN R2R pretraining is the quantity of
visual trajectories, not the quality of instructions.
comment: Accepted by O-DRUM @ CVPR 2023
♻ ☆ "Paraphrasing The Original Text" Makes High Accuracy Long-Context QA
Although LLMs continue to iterate and improve, most open-source models still
have a context window of no more than 4k, limiting their ability to handle
long-context problems. Most existing open-source models for long-context chat
still lack satisfactory accuracy. To address this issue, I approach it from the
perspective of training data and theoretically prove that training the
capability to handle long contexts requires "effective" rather than "long"
data. Based on this, I propose using the "original text paraphrase" task, and
successfully extend the context window of the existing model to 32k by a
low-cost and effective method, achieving extremely high accuracy in
multi-document-QA and surpassing all existing open-source models of the same
scale. The model and training data have been open-sourced on
HuggingFace(https://huggingface.co/yuyijiong/Qwen-14b-chat-yarn-32k) and
WiseModel(https://wisemodel.cn/models/yuyijiong/Qwen-14b-chat-yarn-32k).
comment: Chinese version of this paper can be downloaded from
(https://cloud.tsinghua.edu.cn/d/5894ec4442e54a6aac96/)
♻ ☆ GraphGPT: Graph Instruction Tuning for Large Language Models
Graph Neural Networks (GNNs) have advanced graph structure understanding via
recursive information exchange and aggregation among graph nodes. To improve
model robustness, self-supervised learning (SSL) has emerged as a promising
approach for data augmentation. However, existing methods for generating
pre-trained graph embeddings often rely on fine-tuning with specific downstream
task labels, which limits their usability in scenarios where labeled data is
scarce or unavailable. To address this, our research focuses on advancing the
generalization capabilities of graph models in challenging zero-shot learning
scenarios. Inspired by the success of large language models (LLMs), we aim to
develop a graph-oriented LLM that can achieve high generalization across
diverse downstream datasets and tasks, even without any information available
from the downstream graph data. In this work, we present the GraphGPT framework
that aligns LLMs with graph structural knowledge with a graph instruction
tuning paradigm. Our framework incorporates a text-graph grounding component to
establish a connection between textual information and graph structures.
Additionally, we propose a dual-stage instruction tuning paradigm, accompanied
by a lightweight graph-text alignment projector. This paradigm explores
self-supervised graph structural signals and task-specific graph instructions,
to guide LLMs in understanding complex graph structures and improving their
adaptability across different downstream tasks. Our framework is evaluated on
supervised and zero-shot graph learning tasks, demonstrating superior
generalization and outperforming state-of-the-art baselines.
♻ ☆ Inducing Character-level Structure in Subword-based Language Models with Type-level Interchange Intervention Training ACL 2023
Language tasks involving character-level manipulations (e.g., spelling
corrections, arithmetic operations, word games) are challenging for models
operating on subword units. To address this, we develop a causal intervention
framework to learn robust and interpretable character representations inside
subword-based language models. Our method treats each character as a typed
variable in a causal model and learns such causal structures by adapting the
interchange intervention training method of Geiger et al. (2021). We
additionally introduce a suite of character-level tasks that systematically
vary in their dependence on meaning and sequence-level context. While
character-level models still perform best on purely form-based tasks like
string reversal, our method outperforms character-level models on more complex
tasks that blend form, meaning, and context, such as spelling correction in
context and word search games. Compared with standard subword-based models, our
approach also significantly improves robustness on unseen token sequences and
leads to human-interpretable internal representations of characters.
comment: Findings of the Association for Computational Linguistics: ACL 2023
♻ ☆ VLIS: Unimodal Language Models Guide Multimodal Language Generation EMNLP 2023
Multimodal language generation, which leverages the synergy of language and
vision, is a rapidly expanding field. However, existing vision-language models
face challenges in tasks that require complex linguistic understanding. To
address this issue, we introduce Visual-Language models as Importance Sampling
weights (VLIS), a novel framework that combines the visual conditioning
capability of vision-language models with the language understanding of
unimodal text-only language models without further training. It extracts
pointwise mutual information of each image and text from a visual-language
model and uses the value as an importance sampling weight to adjust the token
likelihood from a text-only model. VLIS improves vision-language models on
diverse tasks, including commonsense understanding (WHOOPS, OK-VQA, and
ScienceQA) and complex text generation (Concadia, Image Paragraph Captioning,
and ROCStories). Our results suggest that VLIS represents a promising new
direction for multimodal language generation.
comment: Accepted as main paper in EMNLP 2023
♻ ☆ Graphmax for Text Generation
In text generation, a large language model (LM) makes a choice of each new
word based only on the former selection of its context using the softmax
function. Nevertheless, the link statistics information of concurrent words
based on a scene-specific corpus is valuable in choosing the next word, which
can help to ensure the topic of the generated text to be aligned with the
current task. To fully explore the co-occurrence information,we propose a
graphmax function for task-specific text generation. Using the graph-based
regularization, graphmax enables the final word choice to be determined by both
the global knowledge from the LM and the local knowledge from the
scene-specific corpus. The traditional softmax function is regularized with a
graph total variation (GTV) term, which incorporates the local knowledge into
the LM and encourages the model to consider the statistical relationships
between words in a scene-specific corpus. The proposed graphmax is versatile
and can be readily plugged into any large pre-trained LM for text generation
and machine translation. Through extensive experiments, we demonstrate that the
new GTV-based regularization can improve performances in various natural
language processing tasks in comparison with existing methods. Moreover,
through human experiments, we observe that participants can easily distinguish
the text generated by graphmax or softmax.
♻ ☆ Communicative Agents for Software Development
Chen Qian, Xin Cong, Wei Liu, Cheng Yang, Weize Chen, Yusheng Su, Yufan Dang, Jiahao Li, Juyuan Xu, Dahai Li, Zhiyuan Liu, Maosong Sun
Software engineering is a domain characterized by intricate decision-making
processes, often relying on nuanced intuition and consultation. Recent
advancements in deep learning have started to revolutionize software
engineering practices through elaborate designs implemented at various stages
of software development. In this paper, we present an innovative paradigm that
leverages large language models (LLMs) throughout the entire software
development process, streamlining and unifying key processes through natural
language communication, thereby eliminating the need for specialized models at
each phase. At the core of this paradigm lies ChatDev, a virtual chat-powered
software development company that mirrors the established waterfall model,
meticulously dividing the development process into four distinct chronological
stages: designing, coding, testing, and documenting. Each stage engages a team
of "software agents", such as programmers, code reviewers, and test engineers,
fostering collaborative dialogue and facilitating a seamless workflow. The chat
chain acts as a facilitator, breaking down each stage into atomic subtasks.
This enables dual roles, allowing for proposing and validating solutions
through context-aware communication, leading to efficient resolution of
specific subtasks. The instrumental analysis of ChatDev highlights its
remarkable efficacy in software generation, enabling the completion of the
entire software development process in under seven minutes at a cost of less
than one dollar. It not only identifies and alleviates potential
vulnerabilities but also rectifies potential hallucinations while maintaining
commendable efficiency and cost-effectiveness. The potential of ChatDev unveils
fresh possibilities for integrating LLMs into the realm of software
development. Our code is available at https://github.com/OpenBMB/ChatDev.
comment: https://github.com/OpenBMB/ChatDev
♻ ☆ FP8-LM: Training FP8 Large Language Models
Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, Ruihang Li, Miaosen Zhang, Chen Li, Jia Ning, Ruizhe Wang, Zheng Zhang, Shuguang Liu, Joe Chau, Han Hu, Peng Cheng
In this paper, we explore FP8 low-bit data formats for efficient training of
large language models (LLMs). Our key insight is that most variables, such as
gradients and optimizer states, in LLM training can employ low-precision data
formats without compromising model accuracy and requiring no changes to
hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision
framework for training LLMs. This framework offers three levels of FP8
utilization to streamline mixed-precision and distributed parallel training for
LLMs. It gradually incorporates 8-bit gradients, optimizer states, and
distributed learning in an incremental manner. Experiment results show that,
during the training of GPT-175B model on H100 GPU platform, our FP8
mixed-precision training framework not only achieved a remarkable 39% reduction
in real memory usage but also ran 75% faster than the widely adopted BF16
framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer
Engine by 37%. This largely reduces the training costs for large foundation
models. Furthermore, our FP8 mixed-precision training methodology is generic.
It can be seamlessly applied to other tasks such as LLM instruction tuning and
reinforcement learning with human feedback, offering savings in fine-tuning
expenses. Our FP8 low-precision training framework is open-sourced at
{https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.
♻ ☆ Narrowing the Gap between Supervised and Unsupervised Sentence Representation Learning with Large Language Model AAAI24
Sentence Representation Learning (SRL) is a fundamental task in Natural
Language Processing (NLP), with the Contrastive Learning of Sentence Embeddings
(CSE) being the mainstream technique due to its superior performance. An
intriguing phenomenon in CSE is the significant performance gap between
supervised and unsupervised methods, with their only difference lying in the
training data. Previous works attribute this performance gap to differences in
two representation properties (alignment and uniformity). However, since
alignment and uniformity only measure the results, they fail to answer "What
aspects of the training data contribute to the performance gap?" and "How can
the performance gap be narrowed?", In this paper, we conduct empirical
experiments to answer these "What" and "How" questions. We first answer the
"What" question by thoroughly comparing the behavior of supervised and
unsupervised CSE during their respective training processes. From the
comparison, we identify the similarity pattern as a key factor to the
performance gap, and introduce a metric, called Relative Fitting Difficulty
(RFD), to measure the complexity of the similarity pattern. Then, based on the
insights gained from the "What" question, we tackle the "How" question by
increasing the pattern complexity of the training data. We achieve this by
leveraging the In-Context Learning (ICL) capability of the Large Language Model
(LLM) to generate data that simulates complex patterns. By utilizing the
hierarchical patterns in the LLM-generated data, we effectively narrow the gap
between supervised and unsupervised CSE. We release our codes and appendix at
https://github.com/BDBC-KG-NLP/NGCSE.
comment: Accepted at AAAI24
♻ ☆ Recurrent Neural Language Models as Probabilistic Finite-state Automata
Studying language models (LMs) in terms of well-understood formalisms allows
us to precisely characterize their abilities and limitations. Previous work has
investigated the representational capacity of recurrent neural network (RNN)
LMs in terms of their capacity to recognize unweighted formal languages.
However, LMs do not describe unweighted formal languages -- rather, they define
\emph{probability distributions} over strings. In this work, we study what
classes of such probability distributions RNN LMs can represent, which allows
us to make more direct statements about their capabilities. We show that simple
RNNs are equivalent to a subclass of probabilistic finite-state automata, and
can thus model a strict subset of probability distributions expressible by
finite-state models. Furthermore, we study the space complexity of representing
finite-state LMs with RNNs. We show that, to represent an arbitrary
deterministic finite-state LM with $N$ states over an alphabet $\alphabet$, an
RNN requires $\Omega\left(N |\Sigma|\right)$ neurons. These results present a
first step towards characterizing the classes of distributions RNN LMs can
represent and thus help us understand their capabilities and limitations.
comment: 9 pages
♻ ☆ Word-Graph2vec: An efficient word embedding approach on word co-occurrence graph using random walk sampling
Word embedding has become ubiquitous and is widely used in various text
mining and natural language processing (NLP) tasks, such as information
retrieval, semantic analysis, and machine translation, among many others.
Unfortunately, it is prohibitively expensive to train the word embedding in a
relatively large corpus. We propose a graph-based word embedding algorithm,
called Word-Graph2vec, which converts the large corpus into a word
co-occurrence graph, then takes the word sequence samples from this graph by
randomly traveling and trains the word embedding on this sampling corpus in the
end. We posit that because of the stable vocabulary, relative idioms, and fixed
expressions in English, the size and density of the word co-occurrence graph
change slightly with the increase in the training corpus. So that
Word-Graph2vec has stable runtime on the large scale data set, and its
performance advantage becomes more and more obvious with the growth of the
training corpus. Extensive experiments conducted on real-world datasets show
that the proposed algorithm outperforms traditional Skip-Gram by four-five
times in terms of efficiency, while the error generated by the random walk
sampling is small.
♻ ☆ Meta-Referential Games to Learn Compositional Learning Behaviours
Human beings use compositionality to generalise from past experiences to
novel experiences. We assume a separation of our experiences into fundamental
atomic components that can be recombined in novel ways to support our ability
to engage with novel experiences. We frame this as the ability to learn to
generalise compositionally, and we will refer to behaviours making use of this
ability as compositional learning behaviours (CLBs). A central problem to
learning CLBs is the resolution of a binding problem (BP). While it is another
feat of intelligence that human beings perform with ease, it is not the case
for state-of-the-art artificial agents. Thus, in order to build artificial
agents able to collaborate with human beings, we propose to develop a novel
benchmark to investigate agents' abilities to exhibit CLBs by solving a
domain-agnostic version of the BP. We take inspiration from the language
emergence and grounding framework of referential games and propose a
meta-learning extension of referential games, entitled Meta-Referential Games,
and use this framework to build our benchmark, the Symbolic Behaviour Benchmark
(S2B). We provide baseline results and error analysis showing that our
benchmark is a compelling challenge that we hope will spur the research
community towards developing more capable artificial agents.
comment: work in progress
♻ ☆ Generating Explanations to Understand and Repair Embedding-based Entity Alignment ICDE 2024
Entity alignment (EA) seeks identical entities in different knowledge graphs,
which is a long-standing task in the database research. Recent work leverages
deep learning to embed entities in vector space and align them via nearest
neighbor search. Although embedding-based EA has gained marked success in
recent years, it lacks explanations for alignment decisions. In this paper, we
present the first framework that can generate explanations for understanding
and repairing embedding-based EA results. Given an EA pair produced by an
embedding model, we first compare its neighbor entities and relations to build
a matching subgraph as a local explanation. We then construct an alignment
dependency graph to understand the pair from an abstract perspective. Finally,
we repair the pair by resolving three types of alignment conflicts based on
dependency graphs. Experiments on a variety of EA datasets demonstrate the
effectiveness, generalization, and robustness of our framework in explaining
and repairing embedding-based EA results.
comment: Accepted in the 40th IEEE International Conference on Data
Engineering (ICDE 2024)
♻ ☆ SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning
We present SeaEval, a benchmark for multilingual foundation models. In
addition to characterizing how these models understand and reason with natural
language, we also investigate how well they comprehend cultural practices,
nuances, and values. Alongside standard accuracy metrics, we investigate the
brittleness of foundation models in the dimensions of semantics and
multilinguality. Our analyses span both open-sourced and closed models, leading
to empirical results across classic NLP tasks, reasoning, and cultural
comprehension. Key findings indicate (1) Most models exhibit varied behavior
when given paraphrased instructions. (2) Many models still suffer from exposure
bias (e.g., positional bias, majority label bias). (3) For questions rooted in
factual, scientific, and commonsense knowledge, consistent responses are
expected across multilingual queries that are semantically equivalent. Yet,
most models surprisingly demonstrate inconsistent performance on these queries.
(4) Multilingually-trained models have not attained "balanced multilingual"
capabilities. Our endeavors underscore the need for more generalizable semantic
representations and enhanced multilingual contextualization. SeaEval can serve
as a launchpad for more thorough investigations and evaluations for
multilingual and multicultural scenarios.
comment: 20 pages. More datasets (2 on Cross-Lingual Consistency and 4 on
Cultural Understanding) and more supported languages. Code:
https://github.com/SeaEval/SeaEval
♻ ☆ Exploring Transformer Extrapolation AAAI
Length extrapolation has attracted considerable attention recently since it
allows transformers to be tested on longer sequences than those used in
training. Previous research has shown that this property can be attained by
using carefully designed Relative Positional Encodings (RPEs). While these
methods perform well on a variety of corpora, the conditions for length
extrapolation have yet to be investigated. This paper attempts to determine
what types of RPEs allow for length extrapolation through a thorough
mathematical and empirical analysis. We discover that a transformer is certain
to possess this property as long as the series that corresponds to the RPE's
exponential converges. Two practices are derived from the conditions and
examined in language modeling tasks on a variety of corpora. As a bonus from
the conditions, we derive a new Theoretical Receptive Field (TRF) to measure
the receptive field of RPEs without taking any training steps. Extensive
experiments are conducted on the Wikitext-103, Books, Github, and WikiBook
datasets to demonstrate the viability of our discovered conditions. We also
compare TRF to Empirical Receptive Field (ERF) across different models, showing
consistently matched trends on the aforementioned datasets. The code is
available at https://github.com/OpenNLPLab/Rpe.
comment: AAAI Camera Ready. Zhen Qin and Yiran Zhong contribute equally to
this paper; Yiran Zhong is the corresponding author. The code is available at
https://github.com/OpenNLPLab/Rpe
♻ ☆ Split and Rephrase with Large Language Models
The Split and Rephrase task, which consists in splitting complex sentences
into a sequence of shorter grammatical sentences, while preserving the original
meaning, can facilitate the processing of complex texts for humans and machines
alike. In this work, we describe an approach based on large language models,
which improves over the state of the art by large margins on all the major
metrics for the task, on publicly available datasets. We also describe results
from two human evaluations that further establish the significant improvements
obtained with large language models and the viability of the approach. We
evaluate different strategies, including fine-tuning pretrained language models
of varying parameter size, and applying both zero-shot and few-shot in-context
learning on instruction-tuned language models. Although the latter were
markedly outperformed by fine-tuned models, they still achieved promising
results overall. Our results thus demonstrate the strong potential of different
variants of large language models for the Split and Rephrase task, using
relatively small amounts of training samples and model parameters overall.
♻ ☆ GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond
With the rapid advancement of large language models (LLMs), there is a
pressing need for a comprehensive evaluation suite to assess their capabilities
and limitations. Existing LLM leaderboards often reference scores reported in
other papers without consistent settings and prompts, which may inadvertently
encourage cherry-picking favored settings and prompts for better results. In
this work, we introduce GPT-Fathom, an open-source and reproducible LLM
evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+
leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across
7 capability categories, all under aligned settings. Our retrospective study on
OpenAI's earlier models offers valuable insights into the evolutionary path
from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3
progressively improves to GPT-4, including technical details like whether
adding code data improves LLM's reasoning capability, which aspects of LLM
capability can be improved by SFT and RLHF, how much is the alignment tax, etc.
Our analysis sheds light on many of these questions, aiming to improve the
transparency of advanced LLMs.
♻ ☆ Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks
Ling Luo, Jinzhong Ning, Yingwen Zhao, Zhijun Wang, Zeyuan Ding, Peng Chen, Weiru Fu, Qinyu Han, Guangtao Xu, Yunzhi Qiu, Dinghao Pan, Jiru Li, Hao Li, Wenduo Feng, Senbo Tu, Yuqi Liu, Zhihao Yang, Jian Wang, Yuanyuan Sun, Hongfei Lin
Objective: Most existing fine-tuned biomedical large language models (LLMs)
focus on enhancing performance in monolingual biomedical question answering and
conversation tasks. To investigate the effectiveness of the fine-tuned LLMs on
diverse biomedical NLP tasks in different languages, We present Taiyi, a
bilingual fine-tuned LLM for diverse biomedical tasks. Materials and Methods:
We first curated a comprehensive collection of 140 existing biomedical text
mining datasets (102 English and 38 Chinese datasets) across over 10 task
types. Subsequently, a two-stage strategy is proposed for supervised
fine-tuning to optimize the model performance across varied tasks. Results:
Experimental results on 13 test sets covering named entity recognition,
relation extraction, text classification, question answering tasks demonstrate
that Taiyi achieves superior performance compared to general LLMs. The case
study involving additional biomedical NLP tasks further shows Taiyi's
considerable potential for bilingual biomedical multi-tasking. Conclusion:
Leveraging rich high-quality biomedical corpora and developing effective
fine-tuning strategies can significantly improve the performance of LLMs within
the biomedical domain. Taiyi shows the bilingual multi-tasking capability
through supervised fine-tuning. However, those tasks such as information
extraction that are not generation tasks in nature remain challenging for
LLM-based generative approaches, and they still underperform the conventional
discriminative approaches of smaller language models.
♻ ☆ ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter
In recent years, advancements in large language models have been remarkable,
with models such as ChatGPT demonstrating exceptional proficiency in diverse
linguistic tasks. The pre-training of large models with billions of parameters,
poses a formidable challenge, primarily due to the scarcity of datasets of a
commensurate scale for effective training. Nevertheless, innovative strategies
have emerged, including methods to fine-tune these pre-trained models using
fewer parameters set, as evidenced by models like MiniGPT-4 and LLaVA. Despite
their potential in various domains, these models remain limited in their
understanding of artistic imagery. They have yet to fully grasp the intricate
nuances of art images or to provide an objective articulation of the emotions
they evoke, in a manner akin to human perception. This work introduces
ArtGPT-4, a pioneering large vision-language model tailored to address the
deficiencies of contemporary models in artistic comprehension. ArtGPT-4
underwent training on image-text pairs utilizing a Tesla A100 device in a mere
2 hours, with a dataset comprising approximately 0.52M entries. Impressively,
the model can render images with an artistic-understanding and convey the
emotions they inspire, mirroring human interpretation. Additionally, this work
presents a unique dataset designed to evaluate the efficacy of vision-language
models. In subsequent evaluations, ArtGPT-4 not only achieved state-of-the-art
performance on the ArtEmis and ArtEmis-v2.0 datasets but also exceeded the
established benchmarks introduced in This study, lagging behind professional
artists' descriptions by a negligible 0.15 points on a 6-point scale. The code
and the pre-trained model are accessible in
https://huggingface.co/Tyrannosaurus/ArtGPT-4.
comment: 20 pages
♻ ☆ Compositional Generalization for Multi-label Text Classification: A Data-Augmentation Approach AAAI'24
Despite significant advancements in multi-label text classification, the
ability of existing models to generalize to novel and seldom-encountered
complex concepts, which are compositions of elementary ones, remains
underexplored. This research addresses this gap. By creating unique data splits
across three benchmarks, we assess the compositional generalization ability of
existing multi-label text classification models. Our results show that these
models often fail to generalize to compositional concepts encountered
infrequently during training, leading to inferior performance on tests with
these new combinations. To address this, we introduce a data augmentation
method that leverages two innovative text generation models designed to enhance
the classification models' capacity for compositional generalization. Our
experiments show that this data augmentation approach significantly improves
the compositional generalization capabilities of classification models on our
benchmarks, with both generation models surpassing other text generation
baselines.
comment: Accepted by AAAI'24
♻ ☆ Understanding the Instruction Mixture for Large Language Model Fine-tuning
While instructions fine-tuning of large language models (LLMs) has been
proven to enhance performance across various applications, the influence of the
instruction dataset mixture on LLMs has not been thoroughly explored. In this
study, we classify instructions into three main types: NLP downstream tasks,
coding, and general chatting, and investigate their impact on LLMs. Our
findings reveal that specific types of instructions are more beneficial for
particular uses, while it may cause harms to other aspects, emphasizing the
importance of meticulously designing the instruction mixture to maximize model
performance. This study sheds light on the instruction mixture and paves the
way for future research.
comment: Instruction Tuning, Large Language Model, Alignment
♻ ☆ The Good, The Bad, and Why: Unveiling Emotions in Generative AI
Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu, Xinyi Wang, Wenxin Hou, Jianxun Lian, Fang Luo, Qiang Yang, Xing Xie
Emotion significantly impacts our daily behaviors and interactions. While
recent generative AI models, such as large language models, have shown
impressive performance in various tasks, it remains unclear whether they truly
comprehend emotions. This paper aims to address this gap by incorporating
psychological theories to gain a holistic understanding of emotions in
generative AI models. Specifically, we propose three approaches: 1)
EmotionPrompt to enhance AI model performance, 2) EmotionAttack to impair AI
model performance, and 3) EmotionDecode to explain the effects of emotional
stimuli, both benign and malignant. Through extensive experiments involving
language and multi-modal models on semantic understanding, logical reasoning,
and generation tasks, we demonstrate that both textual and visual EmotionPrompt
can boost the performance of AI models while EmotionAttack can hinder it.
Additionally, EmotionDecode reveals that AI models can comprehend emotional
stimuli akin to the mechanism of dopamine in the human brain. Our work heralds
a novel avenue for exploring psychology to enhance our understanding of
generative AI models. This paper is an extended version of our previous work
EmotionPrompt (arXiv:2307.11760).
comment: Technical report; an extension to EmotionPrompt (arXiv:2307.11760);
34 pages
♻ ☆ One Shot Learning as Instruction Data Prospector for Large Language Models
Yunshui Li, Binyuan Hui, Xiaobo Xia, Jiaxi Yang, Min Yang, Lei Zhang, Shuzheng Si, Junhao Liu, Tongliang Liu, Fei Huang, Yongbin Li
Aligning large language models(LLMs) with human is a critical step in
effectively utilizing their pre-trained capabilities across a wide array of
language tasks. Current instruction tuning practices often rely on expanding
dataset size without a clear strategy for ensuring data quality, which can
inadvertently introduce noise and degrade model performance. To address this
challenge, we introduce Nuggets, a novel and efficient methodology that employs
one shot learning to select high-quality instruction data from expansive
datasets. Nuggets assesses the potential of individual instruction examples to
act as effective one shot examples, thereby identifying those that can
significantly enhance diverse task performance. Nuggets utilizes a scoring
system based on the impact of candidate examples on the perplexity of a diverse
anchor set, facilitating the selection of the most beneficial data for
instruction tuning. Through rigorous testing on two benchmarks, including
MT-Bench and Alpaca-Eval, we demonstrate that instruction tuning with the top
1% of Nuggets-curated examples substantially outperforms conventional methods
that use the full dataset. These findings advocate for a data selection
paradigm that prioritizes quality, offering a more efficient pathway to align
LLMs with humans.
♻ ☆ How to Bridge the Gap between Modalities: A Comprehensive Survey on Multimodal Large Language Model
This review paper explores Multimodal Large Language Models (MLLMs), which
integrate Large Language Models (LLMs) like GPT-4 to handle multimodal data
such as text and vision. MLLMs demonstrate capabilities like generating image
narratives and answering image-based questions, bridging the gap towards
real-world human-computer interactions and hinting at a potential pathway to
artificial general intelligence. However, MLLMs still face challenges in
processing the semantic gap in multimodality, which may lead to erroneous
generation, posing potential risks to society. Choosing the appropriate
modality alignment method is crucial, as improper methods might require more
parameters with limited performance improvement. This paper aims to explore
modality alignment methods for LLMs and their existing capabilities.
Implementing modality alignment allows LLMs to address environmental issues and
enhance accessibility. The study surveys existing modal alignment methods in
MLLMs into four groups: (1) Multimodal Converters that change data into
something LLMs can understand; (2) Multimodal Perceivers to improve how LLMs
perceive different types of data; (3) Tools Assistance for changing data into
one common format, usually text; and (4) Data-Driven methods that teach LLMs to
understand specific types of data in a dataset. This field is still in a phase
of exploration and experimentation, and we will organize and update various
existing research methods for multimodal information alignment.
♻ ☆ Addressing Token Uniformity in Transformers via Singular Value Transformation UAI2022
Token uniformity is commonly observed in transformer-based models, in which
different tokens share a large proportion of similar information after going
through stacked multiple self-attention layers in a transformer. In this paper,
we propose to use the distribution of singular values of outputs of each
transformer layer to characterise the phenomenon of token uniformity and
empirically illustrate that a less skewed singular value distribution can
alleviate the `token uniformity' problem. Base on our observations, we define
several desirable properties of singular value distributions and propose a
novel transformation function for updating the singular values. We show that
apart from alleviating token uniformity, the transformation function should
preserve the local neighbourhood structure in the original embedding space. Our
proposed singular value transformation function is applied to a range of
transformer-based language models such as BERT, ALBERT, RoBERTa and DistilBERT,
and improved performance is observed in semantic textual similarity evaluation
and a range of GLUE tasks. Our source code is available at
https://github.com/hanqi-qi/tokenUni.git.
comment: UAI2022 Main Conference, Spotlight, combined with supplementary files
♻ ☆ Position Bias Mitigation: A Knowledge-Aware Graph Model for Emotion Cause Extraction ACL2021
The Emotion Cause Extraction (ECE)} task aims to identify clauses which
contain emotion-evoking information for a particular emotion expressed in text.
We observe that a widely-used ECE dataset exhibits a bias that the majority of
annotated cause clauses are either directly before their associated emotion
clauses or are the emotion clauses themselves. Existing models for ECE tend to
explore such relative position information and suffer from the dataset bias. To
investigate the degree of reliance of existing ECE models on clause relative
positions, we propose a novel strategy to generate adversarial examples in
which the relative position information is no longer the indicative feature of
cause clauses. We test the performance of existing models on such adversarial
examples and observe a significant performance drop. To address the dataset
bias, we propose a novel graph-based method to explicitly model the emotion
triggering paths by leveraging the commonsense knowledge to enhance the
semantic dependencies between a candidate clause and an emotion clause.
Experimental results show that our proposed approach performs on par with the
existing state-of-the-art methods on the original ECE dataset, and is more
robust against adversarial attacks compared to existing models.
comment: ACL2021 Main Conference, Oral paper
♻ ☆ LLMR: Real-time Prompting of Interactive Worlds using Large Language Models
Fernanda De La Torre, Cathy Mengying Fang, Han Huang, Andrzej Banburski-Fahey, Judith Amores Fernandez, Jaron Lanier
We present Large Language Model for Mixed Reality (LLMR), a framework for the
real-time creation and modification of interactive Mixed Reality experiences
using LLMs. LLMR leverages novel strategies to tackle difficult cases where
ideal training data is scarce, or where the design goal requires the synthesis
of internal dynamics, intuitive analysis, or advanced interactivity. Our
framework relies on text interaction and the Unity game engine. By
incorporating techniques for scene understanding, task planning,
self-debugging, and memory management, LLMR outperforms the standard GPT-4 by
4x in average error rate. We demonstrate LLMR's cross-platform interoperability
with several example worlds, and evaluate it on a variety of creation and
modification tasks to show that it can produce and edit diverse objects, tools,
and scenes. Finally, we conducted a usability study (N=11) with a diverse set
that revealed participants had positive experiences with the system and would
use it again.
comment: 60 pages, 18 figures; Expanded discussion of experiments and the
influence of various modules
♻ ☆ GPT-4 Technical Report
OpenAI, :, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mo Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O'Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, Barret Zoph
We report the development of GPT-4, a large-scale, multimodal model which can
accept image and text inputs and produce text outputs. While less capable than
humans in many real-world scenarios, GPT-4 exhibits human-level performance on
various professional and academic benchmarks, including passing a simulated bar
exam with a score around the top 10% of test takers. GPT-4 is a
Transformer-based model pre-trained to predict the next token in a document.
The post-training alignment process results in improved performance on measures
of factuality and adherence to desired behavior. A core component of this
project was developing infrastructure and optimization methods that behave
predictably across a wide range of scales. This allowed us to accurately
predict some aspects of GPT-4's performance based on models trained with no
more than 1/1,000th the compute of GPT-4.
comment: 100 pages; updated authors list